Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XML invalid chars filter instead punctuation interrupt #136

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrk-andreev
Copy link

I try to export models that detect programming languages from input strings. That means I need special symbols like { or : but current implementations don't allow this. Moreover it allows me to create invalid xml documents when I have invalid XML Characters. I suggest replacing the current implementation that doesn't allow exporting models with punctuated symbols with a new implementation that filters invalid xml chars.

package com.sun.org.apache.xml.internal.utils;

public class XMLChar {

/**
     * Returns true if the specified character is valid. This method
     * also checks the surrogate character range from 0x10000 to 0x10FFFF.
     * <p>
     * If the program chooses to apply the mask directly to the
     * <code>CHARS</code> array, then they are responsible for checking
     * the surrogate character range.
     *
     * @param c The character to check.
     */
    public static boolean isValid(int c) {
        return (c < 0x10000 && (CHARS[c] & MASK_VALID) != 0) ||
               (0x10000 <= c && c <= 0x10FFFF);
    }

@vruusmann
Copy link
Member

I need special symbols like { or : but current implementations don't allow this.

Can you describe me your pipeline configuration that feeds into this CountVectorizer stage? Is there any text pre-processing going on, how is the text tokenized etc?

The point is that the JPMML-SparkML library is following the PMML specification when deciding what can and what cannot be allowed. Punctuation chars can be significant (depending on the text tokenization mode), so they cannot be enabled/disabled at will.

Moreover it allows me to create invalid xml documents when I have invalid XML Characters.

Care to provide an example about this behaviour?

It should be the case that the JPMML-SparkML is populating an org.dmg.pmml.PMML object with whatever strings it pleases, and in the end this object is marshalled to an PMML XML document using the standard JAXB technology.

You're basically claiming that the JAXB marshaller is somehow misbehaving (eg. not encoding/escaping some characters).

@mrk-andreev
Copy link
Author

In this example I use patched CountVectorizerModelConverter that removes punctuation from vocabulary:

for(int i = 0; i < vocabulary.length; i++){
    String term = vocabulary[i];

    if(TermUtil.hasPunctuation(term)){
        result.add(new TermFeature(encoder, defineFunction, documentFeature, "-"));
    } else {
        result.add(new TermFeature(encoder, defineFunction, documentFeature, term));
    }
}

patched-pmml.zip

I put this jars into pyspark jars directory (venv/lib/python3.8/site-packages/pyspark/jars) and use pyspark2pmml for model export:

from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark2pmml import PMMLBuilder

spark = SparkSession.builder.master('local[*]').getOrCreate()
pm = PipelineModel.load('./model.bin')
pmmlBuilder = PMMLBuilder(spark.sparkContext, spark.createDataFrame([('', '')], schema=['lang', 'content']), pm)
pmmlBuilder.buildFile('./model.pmml')

(remove .zip suffix from .z01 , z02. required for upload)
parts_model.bin.zip
parts_model.bin.z01.zip
parts_model.bin.z02.zip

Output model pmml model will contains invalid xml:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;

public class Main {
    public static void main(String[] args) throws Exception {
        DocumentBuilder parser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        parser.parse(new File("./model.pmml"));
    }
}
[Fatal Error] model.pmml:5444:29: An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:./model.pmml; lineNumber: 5444; columnNumber: 29; An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
	at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
	at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:342)
	at java.xml/javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:206)
	at ai.conundrum.Main.main(Main.java:10)

(remove .zip suffix from .z01 , z02. required for upload)
parts_model.pmml.z01.zip
parts_model.pmml.z02.zip
parts_model.pmml.z03.zip
parts_model.pmml.zip

In vim this parts of file (model.pmml:5444) look like:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants